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Cassandra 


Distributed database with user-configurable partitioning. 


Local data organized as alog-structured merge tree. 


LSM-trees — 10,000 feet view 


Component 1 


e  Datais split among multiple components. optimization 


e Reads have to consult a subset or all to extract data. 


e The components have some structure that define Component 2 4 


how hard reads are. 


write 


e The structure is maintained by moving data 
between components when certain limits are 
reached. 


users 


LSM-tree 


e Writes are first sent to an in-memory 
buffer, the memtable (+ commit log for 
durability). 

e Whenmemtable memory is full, memtable 
is flushed to an on-disk sstable. 

e Multiple sstables are compacted together 
depending on the rules specified by a 
compaction strategy. 

e Reads consult memtables and sstables. 


Level O SSTable 


compaction 


Level 1 SSTable 


Memtable + 
Commit log 


write 


Compaction 


Reorganizes data to make it easier to read. 
Generally, balances between 


e the number of sstables read per query 
(read amplification) 

e thenumber of times a piece of data is written and rewritten 
(write amplification) 


Different strategies balance between the two. 


Read amplification 


Memtable + Í 
Commit log 4 


read write 


Number of sstables read per query. 


Reduced by: 


e Bloom filter 
(key-value workloads). 

e Splitting sstables by time order 
(time-series workloads). 

e Splitting sstables by partition order 
(most queries). 


Level O SSTable 4 


compaction 


Level 1 SSTable 


Write amplification 


Number of times a piece of data is written and 
rewritten. 


Includes commit log and flush, but dominated by 
compaction. 


Only reflected in write latency and throughput 
under sustained load. 


compaction 


Level 1 SSTable 


Level O SSTable 4 


Memtable + 
Commit log 


read 


read 


read EI 


merge 


write 


read Í 


users 


Cassandra's main compaction strategies 


Size-tiered (STCS) 


Groups SSTables close in size in levels. 
Overlapping SSTables on level. 
Compacts when number of SSTables on 
level is at threshold or above. 


Compacts all SSTables on level. 


No SSTable splitting. 


Levelled (LCS) 


Explicitly tracks SSTable level. 


Non-overlapping SSTable run on each level 
(except LO). 

Compacts when size of run is above 
threshold. 

Selects an SSTable to compact with 
overlapping ones in next level. 

Splits on size. 


Target state of the compaction hierarchy 


Size-tiered (STCS) Levelled (LCS) 
e SSTables grouped in levels by powers of the e Onenon-overlapping SSTable run on each 
threshold. level. 


e Less than threshold-many SSTables on level. e Size of level below power of fan factor. 


Target state of the compaction hierarchy 


Size-tiered (STCS) Levelled (LCS) 
e SSTables grouped in levels by powers of the e SSTable runs grouped in levels by powers of 
threshold. the fan factor. 


e Less than threshold-many SSTables on level. e At most one SSTable run on each level. 


Compaction strategies in academia 


Tiered Levelled 
e Multiple overlapping SSTables per level at e One SSTable per level at rest. 
rest. 
e Onecompaction to promote to next level. e Multiple compactions to promote to next 
level. 
e Low write and high read amplification. e Low read and high write amplification. 


e Levels grow by a specified fan factor. 


e = Splitting (vertical/sharding or horizontal/runs) is an 
orthogonal concern. 


(Deletions/overwrites are ignored in descriptions.) 


Basic unified compaction strategy 


e Levels determined by SSTable size. 

e  Per-level size grows by a specified 
fan factor f> 2. 

e Uptot-1SSTables per level at rest, 
where t = 2 (levelled) or t=f(tiered). 

e Result moves up a level when it grows 
enough. 
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Basic unified compaction strategy 


e Levels determined by SSTable size. l no 
e Per-level size grows by a specified Read vs write amplification 

fan factor f> 2. a == Total RA = Compaction WA 
e Uptot-1SSTables per level at rest, 

where t = 2 (levelled) or t =f (tiered). 
e Result moves up a level when it grows 


60 


enough. | 

e Configurable read and write amplification ap 
via one integer scaling parameter w: | 

o f=2+wt=fwhenw>o (tiered) 20 


o f=2-wt=2whenw<o (levelled) 


-30 -20 -10 0 10 


Per-level scaling parameters 


Separate value of w for each level. 
Note: 


For example: e wcanbe specified as 
e 222-8 / TATATALIO E Do nai 
e 42,0,-2 / T6, T4, L2, L4 e R 

higher levels. 


Better ability to accept bursts of data and still keep it 
well organized for reads. 


Especially helpful when the table uses a lot of deletions. 


Sharding 


0 - 1000000 1000000 - 2000000 l 1000000 - 2000000 
Splitting SSTables at predefined positions. : : : 
e Can be achieved via data directories. : i : l l : i : 
e Can be used to easily scale STCS or basic a does a ee Eoo pesa ; 


UCS by -10x. : 
Reduces space overhead. E 5 pe 2 5 2 5 = 


Adds parallelism. 


Many small SSTables on lower levels. | BE | Doo | OCT 


Moving boundaries is hard. 


Density and overlap el ee 


> 400 MiB 
Density: size of SSTable divided by token share. 
e Grows when SSTables are bigger in size. | 
e Also grows when SSTables are split. | >» 
MiB} |MiB] |MIB| |MIB 
Overlap: only count overlapping SSTables towards 
threshold. 
E 
e Overlap drives read amplification. | > 
00 MiB i S 


e Compaction buckets can be formed from 


overlap sections (transitively extended). 
Density 4x100 MiB Density 400 MiB 


Unified compaction strategy 


Levels determined by SSTable density. 
Per-level density grows by a specified 
fan factor f> 2. 
e Uptot-1overlapping SSTables per level at rest, 
where t = 2 (levelled) or t =f (tiered). 
Result moves up a level when it grows enough. 
e Configurable read and write amplification via 
one integer scaling parameter w: 
o f=2+wt=fwhenw>o (tiered) 
o f=2-wt=2whenw<o (levelled) 


UCS sharding scheme 


When compaction starts: 


e Calculate expected result density. 
Define boundaries to split into SSTables of 
close to target size t. 

e Only split in the middle; 
i.e. total number of shards is a power of 2 
multiple of base shard count b. 

e Any boundaries also apply to all higher 
densities. 


shards 


SSTable size 
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100 MB 
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Shard count 


100 MB 1GB 10 GB 100 GB 178 
density 
SSTable size 
100 MB 1GB 10 GB 100 GB 178 


density 


Size vs shard count for t= 1 GiB and b = 4. 


10 TB 


UCS progression example 
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Base shard count b 


L2@T2 
600-1200 MiB 


L1@T3 
200-600 MiB 


LO@T4 
0-200 MiB 


UCS progression example 


Parameters: 
e Base SSTable size m = 50 MiB 
e Scaling parameters T4, T3, L2, L4 
e Target SSTable size t = 100 MiB 
e Base shard count b = 1 


110 


LO@T4, L1 @T3, 
0-200 MiB 200-600 MiB 


UCS progression example 


Compaction triggered, calculate shard count: 


400MiB with token range 1 — 4 shards 


Parameters: 
e Base SSTable size m = 50 MiB 
e Scaling parameters T4, T3, L2, L4 
e Target SSTable size t = 100 MiB 
e Base shard count b = 1 


LO@T4, 
0-200 MiB 
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L1@T3, 
200-600 MiB 


UCS progression example 


Write from beginning, splitting on each boundary. 


Parameters: 
e Base SSTable size m = 50 MiB 
e Scaling parameters T4, T3, L2, L4 
o Target SSTable size t = 100 MiB LO @ T4, L1@T3, 
e Base shard count b = 1 0-200 MiB 200-600 MiB 


UCS progression example 


Write from beginning, splitting on each boundary. 


Parameters: 
e Base SSTable size m = 50 MiB 
e Scaling parameters T4, T3, L2, L4 
o Target SSTable size t = 100 MiB LO @ T4, L1@T3, 
e Base shard count b = 1 0-200 MiB 200-600 MiB 


UCS progression example 


Write from beginning, splitting on each boundary. 


Parameters: 
e Base SSTable size m = 50 MiB 
e Scaling parameters T4, T3, L2, L4 
o Target SSTable size t = 100 MiB LO @ T4, L1@T3, 
e Base shard count b = 1 0-200 MiB 200-600 MiB 


UCS progression example 


Each resulting SSTable has density 400 MiB. 


Parameters: 
e Base SSTable size m = 50 MiB 
e Scaling parameters T4, T3, L2, L4 
o Target SSTable size t = 100 MiB LO @ T4, L1@T3, 
e Base shard count b = 1 0-200 MiB 200-600 MiB 


UCS progression example 


Delete sources. 


Shard boundaries no longer relevant. 


Parameters: 
e Base SSTable size m = 50 MiB 
e Scaling parameters T4, T3, L2, L4 
o Target SSTable size t = 100 MiB LO @ T4, 
o Base shard count b = 1 0-200 MiB 


L1 @T3, 
200-600 MiB 


UCS progression example 


New set of sources, calculate shards: 


240 MiB with token range 1 — 2 shards 


Parameters: 
e Base SSTable size m = 50 MiB 
e Scaling parameters T4, T3, L2, L4 
o Target SSTable size t = 100 MiB LO @ T4, L1@T3, 
e Base shard count b = 1 0-200 MiB 200-600 MiB 


UCS progression example 


Switch writer once. 100] |too}|100| }100 


SED 
=| \ > 120 MiB || 120 MiB 
EE ; ; 


Parameters: 
e Base SSTable size m = 50 MiB 
e Scaling parameters T4, T3, L2, L4 
o Target SSTable size t = 100 MiB LO @ T4, L1@T3, 
e Base shard count b = 1 0-200 MiB 200-600 MiB 


UCS progression example 


Delete sources. 


Shard boundaries no longer relevant. 


Parameters: 
e Base SSTable size m = 50 MiB 
e Scaling parameters T4, T3, L2, L4 
o Target SSTable size t = 100 MiB LO @ T4, 
o Base shard count b = 1 0-200 MiB 


L1@T3, 
200-600 MiB 


UCS progression example 


Identify overlap sections: AE, BE, CF, DFG 


DFG triggers threshold. 
Parameters: 
e Base SSTable size m = 50 MiB 
e Scaling parameters T4, T3, L2, L4 
e Target SSTable size t = 100 MiB 
e Base shard count b = 1 


LO@T4, 
0-200 MiB 


L1@T3, 
200-600 MiB 


UCS progression example 


Extend DFG bucket for overlaps of F to CDFG. 


Parameters: 
e Base SSTable size m = 50 MiB 
e Scaling parameters T4, T3, L2, L4 
o Target SSTable size t = 100 MiB LO @ T4, 
o Base shard count b = 1 0-200 MiB 


T 


L1@T3, 
200-600 MiB 


UCS progression example 
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430 MiB in % space — 860 MiB density, 8 shards 


Parameters: 


50 MiB 


Scaling parameters T4, T3, L2, L4 


Target SSTable size t = 100 MiB 


Base SSTable size m 


L2 @ T2, 
600-1200 MiB 


L1@T3, 
200-600 MiB 


LO @ T4, 


0-200 MiB 
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Base shard count b 


UCS progression example 
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2 SSTables with density 1080 MiB 


Parameters: 


50 MiB 


Base SSTable size m = 


Scaling parameters T4, T3, L2, L4 


Target SSTable size t = 100 MiB 


L2 @ T2, 
600-1200 MiB 


L1@T3, 
200-600 MiB 


LO @ T4, 


0-200 MiB 
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Base shard count b 


UCS progression example 


Delete source SSTables. 


Shard and bucket boundaries no longer relevant. 


Parameters: 
e Base SSTable size m = 50 MiB 
e Scaling parameters T4, T3, L2, L4 
o Target SSTable size t = 100 MiB LO @ T4, 
o Base shard count b = 1 0-200 MiB 


120 MiB 


L1@T3, 
200-600 MiB 


L2 @ T2, 
600-1200 MiB 
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Compaction prioritization 


When compaction is late and there are too many compactions waiting, 
prioritize ones that reduce read amplification most: 


e Buckets with the highest overlap. 
e Onequal overlap, prefer lower level. 
e — On equal and same level, choose randomly. 


Avoids accumulation of SSTables on any level and should lead to a stable state 
able to handle sustained load. 


Time series workloads with TTL 


UCS supports whole table expiration. 
Level is a proxy for age. 
UCS avoids unnecessarily mixing SSTables of different age in compactions. 


Higher-fan-factor UCS (e.g. with scaling parameter T20) works pretty well. 


Changing scaling parameters 


As the strategy is stateless, it just switches to different target state. 
New compactions may be triggered. 
Work already done is still beneficial. 


Splitting/sharding is not affected. 


Upgrade from LCS and STCS 


UCS has corresponding scaling parameters: L10 for LCS default, T4 for STCS default. 
Density understands the progression of data in both. 
Overlaps allows trigger decisions to work correctly initially as well as in mixed states. 


Upgrading from LCS may trigger some compactions. 


Further extensions / future work 


e Target size growth: Control how much growth should be taken by the shard/SSTable 
count vs. the SSTable size. 

e Time-based levels: Allow a time component in levelling decisions, to fully cover TWCS 
applications, and to also support modes like “compact everything together every week” 
for controlling tombstone numbers. 

e Adaptive compaction: Measure read/write load and costs of reads and writes and 
change scaling parameters to optimize resource usage. 


Thank you! 


